Dataset citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
This report explores a dataset containing 1,500 red wines with 12 variables on the chemical properties of the wine.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This sample data set is very tidy, and there are no missing values. Some variables saw large variances: the mean of residual sugar is 2.539 but the max value is 15.5. The max value of chlorides 0.611 (almost 7 times higher than the mean). Total sulfur dioxide ranges from 6 to 289.
## x freq
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For this red wine sample, the lowest grade is 3 and the highest grade is 8. Majority scored between 5 and 6 in an average range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most red wines samples are between 3.0 - 3.5 on the pH scale. In general, the pH level of most wines is between 3-4, and I noticed in this sample the lowest pH is 2.74 and the highest pH is 4.010.
Below, we subset the wine quality to see the distribution of pH level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.380 3.384 3.500 3.900
It shows that the pH distribution is varied depending on wine quality. Both distributions are normal but with different means and variances. The mean of high quality is slightly less than the mean of low quality, but the variance of low quality is much bigger.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density of most wines is very close to the density of water, and the density distribution is normal with values ranged from 0.9901 to 1.0040.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol level distribution looks skewed left. Most frequently wine samples have 9.5% alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Fixed acidity, that does not evaporate readily, in most wines are in a range of between 7 and 9. In terms of volatile acidity, most wines have between 0.3 and 0.7. Around 150 wines have a high volatile acidity of above 8, which can lead to an unpleasant, vinegar taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid can add ‘freshness’ and flavour to wines, and most wine samples have a different level, ranging from 0.05 to 0.5. However, there are fewer wines with citric acid of more than 0.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual sugar for this red wine samples is heavily skewed to left, which means the wine samples tend to be less sweet. To examine closely the residual sugar distribution between 1 and 4, the most frequent values are between 1.8 and 2.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides represent the amount of salt in the wine. In this red wine data set, the most frequent value of chlorides is 0.1. To see the distribution of chlorides clearer I limited the data between 0 and 0.2 to find a normal distribution with a mean of around 0.08.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free from of sulfur dioxide prevents microbial growth and the oxidation of wine, and most wines have in the range of 0 - 20.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
The total sulfur dioxide level in most wines vary between 0 - 100. To estimate the level of bound form of sulfur dioxide, I calculated the difference in a new variable called bound sulfur dioxide, and actually two-thirds of bound sulfur dioxide values is between 0 - 40.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates are a wine additive which can contribute to sulfur dioxide levels, and 0.5 - 1 of sulphates are observed in most wines.
There are 1,500 red wines in the dataset with 12 variables on the chemical properties of the wine (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality). The only categorical variable is quality. All other variables are continuous variables.
The main features in the data set are pH values and quality. I’d like to know if pH level decides the quality of wine. I suspect some other combined variables are also likely to help build a predicted model to grade wine quality.
Volatile.acidity, citric acid, residual sugar and free sulfur dioxide likely contribute quality of the wine.
I created a new variable called bound sulfur dioxide by calculating the difference of total sulfur dioxide and free sulfur dioxide.
The distribution of residual sugar is heavily skewed to left, so I have to subset just part of the data between 0 and 4 to see better distribution. For chlorides, the distribution is also left-skewed, and I limited the data between 0 and 0.2.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## bound.sulfur.dioxide -0.08 0.10 0.07
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## bound.sulfur.dioxide 0.17 0.06 0.43
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## bound.sulfur.dioxide 0.96 0.10 -0.11 0.03 -0.22
## quality bound.sulfur.dioxide
## fixed.acidity 0.12 -0.08
## volatile.acidity -0.39 0.10
## citric.acid 0.23 0.07
## residual.sugar 0.01 0.17
## chlorides -0.13 0.06
## free.sulfur.dioxide -0.05 0.43
## total.sulfur.dioxide -0.19 0.96
## density -0.17 0.10
## pH -0.06 -0.11
## sulphates 0.25 0.03
## alcohol 0.48 -0.22
## quality 1.00 -0.21
## bound.sulfur.dioxide -0.21 1.00
From the correlation matrix, pH, residual sugar and free sulfur dioxide seem to have no correlations with quality, but quality has a negative moderate relationship with volatile acidity (-0.39), a positive moderate relationship (+0.48) with alcohol and a positive weak relationship with citric acid (0.23).
Apart from the positive strong relationship between total sulfur dioxide and bound sulfur dioxide due to the previous calculation, there are three strong relationships in this data sets that I want to explore: a strong negative relationship (-0.50) between alcohol and density. a strong negative relationship (-0.55) between volatile.acidity and citric.acid. a strong negative relationship (-0.54) between pH and citric.acid,
## wqr$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wqr$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wqr$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wqr$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wqr$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wqr$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## wqr$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wqr$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wqr$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wqr$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wqr$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wqr$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## wqr$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wqr$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wqr$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wqr$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wqr$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wqr$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
The trend between quality and alcohol is very clear that when the alcohol percentage increases the quality improves. The relationship between quality and volatile acidity is negative, which means better quality is observed in lower volatile acidity. The slope is less steep between quality and citric acid, but it shows a higher level of citric acid contributes to a better quality of wine.
##
## Pearson's product-moment correlation
##
## data: wqr$density and wqr$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
When density increases, the alcohol level decreases.
##
## Pearson's product-moment correlation
##
## data: wqr$citric.acid and wqr$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Volatile acidity and citric acid also demonstrated a negative relationship. It is interesting to see that both variables are related to quality in some degree: volatile acidity has a negative relationship with quality and citric acid has a positive relationship with quality.
##
## Pearson's product-moment correlation
##
## data: wqr$pH and wqr$fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
It is not a surprise to see the strongest relationship is between pH and fixed acidity, as pH scale measure how acidic a substance is.
There are only two variables showing a slightly stronger relationship with quality: a negative moderate relationship with volatile acidity (-0.39), a positive moderate relationship (+0.48) with alcohol. Surprisingly, the pH level has a very weak relationship with quality (-0.06).
A total of 6 strong relationships: a strong negative relationship (-0.50) between alcohol and density. a strong negative relationship (-0.55) between volatile.acidity and citric.acid. a strong negative relationship (-0.68) between fixed acidity and pH, a strong negative relationship (-0.54) between pH and citric.acid, a strong positive relationship (+0.67) between fixed acidity and density, a strong positive relationship (+0.67) between fixed acidity and citric.acid.
It is obvious that total sulfur dioxide and bound sulfur dioxide have a very strong relationship (+0.96) as it was calculated by subtracting free sulfur dioxide. The strongest relationship is between pH and fixed acidity (-0.68), which is again quite reasonable as pH level goes against fixed acidity.
## [1] "Mean of density by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.9966887
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 0.9968673
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 0.9960303
## [1] "Mean of alcohol % by quality cut"
## wqr$quality.cut: (0,4]
## [1] 10.21587
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 10.25272
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 11.51805
High-quality wine [6,10] have a higher percentage of alcohol with a varied range of density between 0.990 and 1.000. On the other hand, low-quality wine [0,4] has a lower percentage of alcohol with the density in a more defined range of 0.995 and 1.000.
## [1] "Mean of fixed acidity by quality cut"
## wqr$quality.cut: (0,4]
## [1] 7.871429
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 8.254284
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 8.847005
## [1] "Mean of citric acid by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.1736508
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 0.2582638
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 0.3764977
Low-quality wines see lower fixed acidity, especially with a lower citric acid. High-quality wines have a combination of higher citric acid and higher fixed acidity.
## [1] "Mean of pH by quality cut"
## wqr$quality.cut: (0,4]
## [1] 3.384127
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 3.311296
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 3.288802
## [1] "Mean of citric acid by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.1736508
## --------------------------------------------------------
## wqr$quality.cut: (4,6]
## [1] 0.2582638
## --------------------------------------------------------
## wqr$quality.cut: (6,10]
## [1] 0.3764977
High-quality wines have a combination of lower pH and higher citric acid, while low-quality wines sees a level of higher pH and lower citric acid.
##
## Pearson's product-moment correlation
##
## data: wqr$ratio and wqr$quality
## t = 7.9077, df = 1597, p-value = 4.854e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1464861 0.2408427
## sample estimates:
## cor
## 0.1941134
It does not seem to have any strong relationship (+0.19) between the ratio of a free form of sulfur dioxide and quality of the red wines.
High-quality wines have a combination of lower pH and higher citric acid or a combination of higher citric acid and higher fixed acidity. Low-quality wines see a mix of higher pH and lower citric acid or a mix of lower fixed acidity and lower citric acid.
High-quality wines have a higher percentage of alcohol but doesn’t correlate with any particular of density.It is surprising to note that there is no correlation between the ratio of a free form of sulfur dioxide and quality of the red wines.
A total of 1319 red wine samples (more than 80%) are graded 5 and 6, and there are no wine samples being marked less than 3 or more than 8.
A positive moderate relationship (0.48) is observed between alcohol percentage and wine quality, which mean wine quality grows when alcohol percentage increases. It it notable that the mean of alcohol percentage between quality 4 and quality 5 doesn’t show a linear growth (10% vs 9.7%).
There is a clear pattern that low-quality wines [0,4] tend to have lower fixed acidity and a lower citric acid and high-quality wines [6, 10] have a combination of high citric acid and high fixed acidity. More specifically, low- quality wines have a mean of fixed acidity of 7.87 and a mean of citric acid of 0.17. For high-quality wines have a mean of fixed acidity of 8.84 and a mean of citric acid of 0.37.
I found I don’t need to do data wrangling for this sample data set as it is very tidy. To my surprise, pH level doesn’t have a strong correlation with quality.I also didn’t see any strong correlation between the ratio of a free form of sulfur dioxide and quality of the red wines.
In my bivariate analysis, I discovered alcohol and volatile acidity have a moderate relationship with quality. My multivariate analysis shows high-quality wines have a combination of lower pH and higher citric acid or a combination of higher citric acid and higher fixed acidity. On the other hand, low-quality wines see a mix of higher pH and lower citric acid or a mix of lower fixed acidity and lower citric acid.
However, I don’t think I can confidently say that a certain combination of variables proves to provide good quality wines. It seems like the quality grading by experts doesn’t purely based on 12 variables provided in the data set. It would be interesting to analyse the red quality by the year of production, the place of the product, etc.